A quotient filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set (an approximate membership query filter, AMQ). A query will elicit a reply specifying either that the element is definitely not in the set or that the element is probably in the set. The former result is definitive; i.e., the test does not generate false negatives. But with the latter result there is some probability, ε, of the test returning "element is in the set" when in fact the element is not present in the set ( i.e., a false positive). There is a tradeoff between ε, the false positive rate, and storage size; increasing the filter's storage size reduces ε. Other AMQ operations include "insert" and "optionally delete". The more elements are added to the set, the larger the probability of false positives.
A typical application for quotient filters, and other AMQ filters, is to serve as a proxy for the keys in a database on disk. As keys are added to or removed from the database, the filter is updated to reflect this. Any lookup will first consult the fast quotient filter, then look in the (presumably much slower) database only if the quotient filter reported the presence of the key. If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed.
A quotient filter has the usual AMQ operations of insert and query. In addition it can also be merged and re-sized without having to re-hash the original keys (thereby avoiding the need to access those keys from secondary storage). This property benefits certain kinds of log-structured merge-trees.
For some key d which hashes to the fingerprint dH, let its quotient be dQ and the remainder be dR. QF will try to store the remainder in slot dQ, which is known as the canonical slot. However the canonical slot might already be occupied because multiple keys can hash to the same fingerprint—a hard collision—or because even when the keys' fingerprints are distinct they can have the same quotient—a soft collision. If the canonical slot is occupied then the remainder is stored in some slot to the right.
As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots. Such a set of fingerprints is defined as a run. Note that a run's first fingerprint might not occupy its canonical slot if the run has been forced right by some run to the left.
However a run whose first fingerprint occupies its canonical slot indicates the start of a cluster. The initial run and all subsequent runs comprise the cluster, which terminates at an unoccupied slot or the start of another cluster.
The three additional bits are used to reconstruct a slot's fingerprint. They have the following function:
The various combinations have the following meaning:
Empty Slot |
Slot is holding start of run that has been shifted from its canonical slot. |
not used. |
Slot is holding continuation of run that has been shifted from its canonical slot. |
Slot is holding start of run that is in its canonical slot. This is also the start of the cluster. |
Slot is holding start of run that has been shifted from its canonical slot. Also the run for which this is the canonical slot exists but is shifted right. |
not used. |
Slot is holding continuation of run that has been shifted from its canonical slot. Also the run for which this is the canonical slot exists but is shifted right. |
We hash the key to produce its fingerprint, dH, which we then partition into its high-order q bits, dQ, which comprise its quotient, and its low-order r bits, dR, which comprise its remainder. Slot dQ is the key's canonical slot. That slot is empty if its three meta-data bits are false. In that case the filter does not contain the key.
If the canonical slot is occupied then we must locate the quotient's run. The set of slots that hold remainders belonging to the same quotient are stored contiguously and these comprise the quotient's run. The first slot in the run might be the canonical slot but it is also possible that the entire run has been shifted to the right by the encroachment from the left of another run.
To locate the quotient's run we must first locate the start of the cluster. The cluster consists of a contiguous set of runs. Starting with the quotient's canonical slot we can scan left to locate the start of the cluster, then scan right to locate the quotient's run.
We scan left looking for a slot with is_shifted is false. This indicates the start of the cluster. Then we scan right keeping a running count of the number of runs we must skip over. Each slot to the left of the canonical slot having is_occupied set indicates another run to be skipped, so we increment the running count. Each slot having is_continuation clear indicates the start of another run, thus the end of the previous run, so we decrement the running count. When the running count reaches zero, we are scanning the quotient's run. We can compare the remainder in each slot in the run with dR. If found, we report that the key is (probably) in the filter otherwise we report that the key is definitely not in the filter.
In state 2 elements c and d have been added. Element c has a quotient of 1, the same as b. We assume bR < cR so cR is shifted into slot 2, and is marked as both a continuation and shifted. Element d has a quotient of 2. Since its canonical slot is in use, it is shifted into slot 3, and is marked as shifted. In addition its canonical slot is marked as occupied. The runs for quotients 1 and 2 now comprise a cluster.
In state 3 element a has been added. Its quotient is 1. We assume aR < bR so the remainders in slots 1 through 4 must be shifted. Slot 2 receives bR and is marked as a continuation and shifted. Slot 5 receives eR and is marked as shifted. The runs for quotients 1, 2 and 4 now comprise a cluster, and the presence of those three runs in the cluster is indicated by having slots 1, 2 and 4 being marked as occupied.
Quotient filters offer two benefits in some applications.
The space used by quotient filters is comparable to that of Bloom filters. However quotient filters can be efficiently merged within memory without having to re-insert the original keys.
This is particularly important in some log structured storage systems that use the log-structured merge-tree or LSM-tree. The LSM-tree is actually a collection of trees but which is treated as a single key-value store. One variation of the LSM-Tree is the Sorted Array Merge Tree or SAMT. In this variation, a SAMT's component trees are called . Each Wanna- B-tree has an associated quotient filter. A query on the SAMT is directed at only select Wanna- B-trees as evidenced by their quotient filters.
The storage system in its normal operation compacts the SAMT's Wanna- B-trees, merging smaller Wanna- B-trees into larger ones and merging their quotient filters. An essential property of quotient filters is that they can be efficiently merged without having to re-insert the original keys. Given that for large data sets the Wanna- B-trees may not be in memory, accessing them to retrieve the original keys would incur many I/Os.
By construction the values in a quotient filter are stored in sorted order. Each run is associated with a specific quotient value, which provides the most significant portion of the fingerprint, the runs are stored in order and each slot in the run provides the least significant portion of the fingerprint.
So, by working from left to right, one can reconstruct all the fingerprints and the resulting list of integers will be in sorted order. Merging two quotient filters is then a simple matter of converting each quotient filter into such a list, merging the two lists and using it to populate a new larger quotient filter. Similarly, we can halve or double the size of a quotient filter without rehashing the keys since the fingerprints can be recomputed using just the quotients and remainders.
|
|